-
Notifications
You must be signed in to change notification settings - Fork 46
A4 Llama 3.1 70B recipe on NeMo 2.0 with GCSFuse storage #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| ``` | ||
| cd $REPO_ROOT/src/utils/checkpointing_metrics | ||
| python3 calculate_checkpoint_metrics.py --gcs_logs_path=${GCS_LOGS_PATH} | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you test this? I'm not sure if it has been updated to work with Nemo 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update the file path to match the other recipes in this directory with a "-gcs" suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| @@ -0,0 +1,303 @@ | |||
| <!-- mdformat global-off --> | |||
| # Pretrain llama3-1-70b-gpus128 workloads on a4 GKE Node pools with Nvidia NeMo Framework using Google Cloud Storage for training data and checkpoints | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A4
| @@ -0,0 +1,303 @@ | |||
| <!-- mdformat global-off --> | |||
| # Pretrain llama3-1-70b-gpus128 workloads on a4 GKE Node pools with Nvidia NeMo Framework using Google Cloud Storage for training data and checkpoints | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there supposed to be 2 spaces here?
|
|
||
| ### Configure and submit a pretraining job | ||
|
|
||
| #### Using 16 node (64 gpus) fp8 precision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is if fp8 or bf16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bf16. updated
| ### Analyze results | ||
| When completed, the job creates several artifacts, including logs and traces, and places them | ||
| in the Google Cloud Storage logs bucket as follows: | ||
| ``` | ||
| gs://${GCS_BUCKET_LOGS}/nemo-experiments-storage/<JOB_ID> | ||
| ├── nemo-configuration.yaml | ||
| ├── lightning_logs.txt | ||
| ├── nemo_error_logs.txt | ||
| ├── nemo_log_globalrank-[RANK]_localrank-[LOCAL].txt | ||
| ├── dllogger | ||
| │ ├── rank-0 | ||
| │ │ ├── dllogger.json | ||
| ... | ||
| ``` | ||
| - `nemo-configuration.yaml`: the NeMo configuration used by the pretraining script. This includes | ||
| the combined [configuration file](../16node-bf16-seq8192-gbs512/llama3-1-70b.py) | ||
| and the command line overrides | ||
| - `lightning_logs.txt`: the log files generated by PyTorch Lightning, which is used by NeMo | ||
| - `nemo_error_logs.txt`: the warning and error logs generated by NeMo | ||
| - `nemo_log_globalrank-[RANK]_localrank-[LOCAL].txt`: the NeMo logs for each rank | ||
| - `dllogger/`: The log captured by [NVIDIA DLLogger](https://github.com/NVIDIA/dllogger): | ||
| DLLogger is configured to store logs on the rank 0 node. The log is in JSON format | ||
| and includes loss, step_time, and other key metrics for each training step | ||
| The `<JOB_ID>` has the following format: | ||
| - `$USER--llama31-70b-gcs-[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]`, where the suffix of the ID is a day and time when the job was started. | ||
| The NeMo log files include information about checkpoint operations on each rank. You can use the [checkpointing_metrics](../../../../src/utils/checkpointint_metrics) utility to calculate statistics for checkpoint write times. | ||
| To calculate statistics: | ||
| 1. Set a path to the NeMo logs. | ||
| ``` | ||
| export JOB_ID=<JOB_ID> | ||
| export GCS_LOGS_PATH="gs://${GCS_BUCKET_LOGS}/nemo-experiments-storage/${JOB_ID}" | ||
| ``` | ||
| Replace `<JOB_ID>` with the ID of your job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section seems to end abruptly. I'm ok if we don't want to update the checkpoint metrics utility, but we should at least tell users where they can find this data in the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| - file-cache:enable-parallel-downloads:true | ||
| - file-system:kernel-list-cache-ttl-secs:0 | ||
| - write:enable-streaming-writes:true | ||
| - machine-type:a3-highgpu-8g |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Add a comment on which version the fix has been applied and we need to version for which this workaround is required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"These gcsfuse versions" -> "Earlier GCSFuse versions"
| ``` | ||
| Replace `<JOB_ID>` with the ID of your job. | ||
| The NeMo log files include information about checkpoint operations on each rank. Users can find checkpoint read and write informatiom in `nemo_log_globalrank-[RANK]_localrank-[LOCAL].txt` files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"information"
Can we just tell them to check rank 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| envs: | ||
| - name: GLOO_SOCKET_IFNAME | ||
| value: eth0 | ||
| gcsSidecarImage: gcr.io/gcs-tess/ashmeen/gcs-fuse-csi-driver-sidecar-mounter:v3.2.0_test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't specify the sidecar image since this image is not publicly available.
| - Kueue and JobSet APIs installed | ||
| - Kueue configured to support Topology Aware Scheduling | ||
| - A regional Google Cloud Storage (GCS) bucket to store logs. | ||
| - A regional Google Cloud Storage (GCS) bucket with [hierarchical](https://cloud.google.com/storage/docs/hns-overview)) namespace to store the Pile dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an extra ")" on this line.
| - Helm | ||
| - kubectl | ||
|
|
||
| *Important: All GCS buckets must be in the same region as the GKE cluster*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use two ** to make this bold?
Add complete helm chart with readme and tests the scripts.
TODO: update src/helm-charts/storage/gcs-fuse/templates/pv.yaml with comment when to add
machine-type:a3-highgpu-8gbased on b/450059657#comment27